Categorize urban density

In this Python notebook, you'll create an interactive map of the United States that shows four levels of population density. You'll extract U.S. Census statistics on zip code areas, population counts, and median housing age. You'll join those statistics into a single DataFrame and calculate population density per square kilometer. Then you'll run some SQL-like queries on the DataFrame to classify the zip codes into the four categories of interest. Finally, you'll create an interactive map using Mapbox technology.

This notebook runs on Python 2.7 with Spark.

Overview

You'll use the categories to describe population density that are based on an academic study of urban structure and density, as described in the June 2014 article, From Jurisdictional to Functional Analysis of Urban Cores & Suburbs.

This article groups population into four categories that are based on population density and age of the houses:

  • Urban: The urban core that was established before 1940 and has a population density of > 2,900 people per square kilometer.
  • Auto suburban, early: The median house was built from 1946 to 1979 and a population density between < 2,900 people/sq. km and > 100 people/sq. km. Primarily single-family homes with a lot size that's usually around a 1/4 to 1/2 acre.
  • Auto suburban, later: The median house was built after 1979, and a population density between < 2,900 people/sq. km and > 100 people/sq. km. Single-family homes on 1 acre lots or larger.
  • Auto exurban: All others. Truly rural areas consisting primarily of 1+ acre residential lots, farms, and forests.

1. Import libraries

Import the pandas, numpy, and os libraries:

In [1]:
import pandas as pd, numpy as np, os
Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20180914113924-0002

2. Collect U.S. Census data

You'll use the U.S. Census data from the 2013 US Census American Community Survey (ACS), 5-year estimates.

You're using this particular version of the ACS for these reasons:

  • 5-year estimates are the most accurate data outside of the decennial census as explained here.
  • 2013 is the most recent data set with 5-year estimates.
  • This data set uses the zip code tabulation area (ZCTA), which provides the geographic boundaries of the zip codes so you can perform spatial analyses.
  • This data set is smaller than the full Census, but still has the important income, education, race, age, and occupation demographics that you'll want.

You'll get the data sets and combine them:
2.1 Get zip code areas
2.2 Get population and age by zip code
2.3 Get the housing age data
2.4 Join the data sets
2.5 Rename the columns

2.1 Get zip code areas

To get the zip code areas:

  1. Go to the ZIP Code tabulation areas (ZCTAs) page.
  2. Click the link icon.
  3. Copy the data access link.
  4. Replace the text "YOUR ACCESS CODE" in the next cell with your data access link.
In [2]:
GEO_URL = "YOUR ACCESS CODE"
geo_df = pd.read_csv( GEO_URL, usecols=['GEOID_Data','ALAND'], dtype={"GEOID_Data": np.str, "ALAND": np.int} )
geo_df.columns = ['GEOID','ALAND']
geo_df = geo_df.set_index('GEOID')
geo_df.head()
Out[2]:
ALAND
GEOID
86000US43451 63411475
86000US43452 121783676
86000US43456 9389361
86000US43457 48035540
86000US43458 2573816

2.2 Get population and age by zip code

Get a data access link for Population and age by zip code and paste it into the next cell.

In [3]:
POP_URL = "YOUR ACCESS CODE"
pop_df = pd.read_csv( POP_URL, usecols=['GEOID','B01001e1'], dtype={"GEOID": np.str} )
pop_df.columns = ['GEOID','POPULATION']
pop_df = pop_df.set_index('GEOID')
pop_df.head()
Out[3]:
POPULATION
GEOID
86000US01001 17245
86000US01002 29266
86000US01003 11032
86000US01005 5356
86000US01007 14673

2.3 Get the housing age data

Get a data access link for Housing (2015) and paste it into the next cell.

In [4]:
HOUSE_URL = "YOUR ACCESS CODE"
housing_df = pd.read_csv( HOUSE_URL, usecols=['GEOID','B25035e1'], dtype={"GEOID": np.str} )
housing_df = housing_df.set_index('GEOID')
housing_df.sample(5)
Out[4]:
B25035e1
GEOID
86000US46164 1976.0
86000US10167 NaN
86000US55772 1973.0
86000US65647 1979.0
86000US40972 1984.0

2.4 Join the data sets

Join the three data sets into a data set named urban_df:

In [5]:
urban_df = geo_df.join(pop_df)
urban_df = urban_df.join(housing_df)
urban_df.sample(5)
Out[5]:
ALAND POPULATION B25035e1
GEOID
86000US47388 748793 365.0 1953.0
86000US40040 53078785 456.0 1956.0
86000US64015 58835113 30828.0 1984.0
86000US12548 15205637 1380.0 1975.0
86000US78163 199805869 10278.0 1995.0

2.5 Rename the columns

Give the columns more meaningful names:

In [6]:
urban_df.columns = ['AREAMSQ','Population','MEDYRBUILT']
urban_df.sample(5)
Out[6]:
AREAMSQ Population MEDYRBUILT
GEOID
86000US49228 205775074 5735.0 1949.0
86000US79510 710413459 7973.0 1980.0
86000US72473 149629792 2604.0 1973.0
86000US65239 141087463 1237.0 1990.0
86000US43050 403861738 28612.0 1964.0

3. Calculate and group population density

You'll find the population density and assign a category for each area.

Calculate the population density per square kilometer: persons per square km = persons / (area in square meters / 1,000,000)

In [7]:
urban_df['POPPERKMSQ'] = urban_df['Population'] / (urban_df['AREAMSQ']/1000000)
urban_df.sample(4)
Out[7]:
AREAMSQ Population MEDYRBUILT POPPERKMSQ
GEOID
86000US08403 5434479 1066.0 1971.0 196.154958
86000US29907 108916064 12049.0 1993.0 110.626473
86000US28433 290190010 4955.0 1978.0 17.075019
86000US81422 694394451 520.0 1980.0 0.748854

Assign a category to each area based on the population density:

In [8]:
urban_df['CAT'] = 'EXURBAN'
urban_df['CAT'][(urban_df['POPPERKMSQ'] >= 2900)] = 'URBAN'
urban_df['CAT'][(urban_df['POPPERKMSQ'] < 2900) & (urban_df['POPPERKMSQ'] >= 100) & (urban_df['MEDYRBUILT'] < 1980) & (urban_df['MEDYRBUILT'] >= 1946)] = 'SUBURBANEARLY'
urban_df['CAT'][(urban_df['POPPERKMSQ'] < 2900) & (urban_df['POPPERKMSQ'] >= 100) & (urban_df['MEDYRBUILT'] >= 1980)] = 'SUBURBANLATE'
urban_df.describe()
/opt/ibm/conda/miniconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/opt/ibm/conda/miniconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/opt/ibm/conda/miniconda2/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[8]:
AREAMSQ Population MEDYRBUILT POPPERKMSQ
count 3.314400e+04 32989.000000 32102.000000 32989.000000
mean 2.242562e+08 9443.177453 1971.867298 487.536570
std 6.562548e+08 13858.010530 15.678772 1909.973331
min 5.094000e+03 0.000000 1939.000000 0.000000
25% 2.341591e+07 719.000000 1962.000000 7.754555
50% 9.283519e+07 2781.000000 1974.000000 30.194358
75% 2.292059e+08 12830.000000 1983.000000 249.247366
max 3.478591e+10 114734.000000 2011.000000 71226.281507

Look at a few records to do a quick sanity check:

In [9]:
urban_df.sample(10)
Out[9]:
AREAMSQ Population MEDYRBUILT POPPERKMSQ CAT
GEOID
86000US36727 158995027 168.0 1995.0 1.056637 EXURBAN
86000US78941 404766109 2754.0 1984.0 6.803929 EXURBAN
86000US85551 138832155 437.0 1975.0 3.147686 EXURBAN
86000US33884 51902876 28995.0 1989.0 558.639564 SUBURBANLATE
86000US11964 24189418 1846.0 1976.0 76.314362 EXURBAN
86000US28311 97170665 33727.0 1987.0 347.090349 SUBURBANLATE
86000US84772 17614144 103.0 1971.0 5.847573 EXURBAN
86000US89422 1140415940 82.0 1976.0 0.071904 EXURBAN
86000US95919 64005514 1305.0 1978.0 20.388868 EXURBAN
86000US97347 111446076 1726.0 1991.0 15.487311 EXURBAN

4. Prepare the data for visualization

You'll convert the data to JSON format and create a JavaScript variable to visualize the data in a browser.

Convert the data to JSON format:

In [10]:
json_data_from_python = urban_df.reset_index().to_json(orient="records")

Create a JavaScript variable called vizObj for your JSON data. The data object vizObj is a global variable in your window that you could pass to another JavaScript function call.

In [11]:
from IPython.display import Javascript

Javascript("""window.vizObj={};""".format(json_data_from_python))
Out[11]:

5. Create the map

Run the next cell to create an interactive map using Mapbox. A Mapbox access token and mapID are supplied for you. You can substitute your own access token and mapID if you have them.

In [12]:
%%javascript
require.config({
  paths: {
      mapboxgl: 'https://api.tiles.mapbox.com/mapbox-gl-js/v0.39.1/mapbox-gl',
      bootstrap: 'https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min'
  }
});

IPython.OutputArea.auto_scroll_threshold = 9999;
require(['mapboxgl', 'bootstrap'], function(mapboxgl, bootstrap){
    mapboxgl.accessToken = 'pk.eyJ1IjoicmFqcnNpbmdoIiwiYSI6ImpzeDhXbk0ifQ.VeSXCxcobmgfLgJAnsK3nw';
    var map = new mapboxgl.Map({
        container: 'map', // HTML element id in which to draw the map
        style: 'mapbox://styles/mapbox/light-v9', //mapbox map to use
        center: [-71.09, 42.44], // starting center position
        zoom: 9 // starting zoom
    });
    
    var densitytypes = ["URBAN", "SUBURBANEARLY", "SUBURBANLATE", "EXURBAN"];
    var densitycolors = ["#d7301f", "#fc8d59", "#fdcc8a", "#fef0d9"];
    
    // Join local JSON data with vector tile geometry
    var maxValue = 71227;
    var data = vizObj;

    // Get the vector geometries to join
    // US Census Data Source
    // https://www.census.gov/geo/maps-data/data/cbf/cbf_state.html
    var mapId = "rajrsingh.bjb1ffhz";
    var layerName = "zipsimple0025-btzfjd";
    var vtMatchProp = "GEOID_Data";
    var dataMatchProp = "GEOID";
    var dataStyleProp = "CAT";


    map.on('load', function() {

        // Add source for state polygons hosted on Mapbox
        map.addSource("zips", {
            type: "vector",
            url: "mapbox://" + mapId
        });

        // First value is the default, used where there is no data
        var stops = [["0", "#888888"]];

        // Calculate color for each state based on the unemployment rate
        data.forEach(function(row) {
            if (densitytypes.indexOf(row[dataStyleProp]) >= 0 ) {
                var color = densitycolors[densitytypes.indexOf(row[dataStyleProp])];
                stops.push([row[dataMatchProp], color]);
            }
        });

        // Add layer from the vector tile source with data-driven style
        map.addLayer({
            "id": "zips-join",
            "type": "fill",
            "source": "zips",
            "source-layer": layerName,
            "paint": {
                "fill-color": {
                    "property": vtMatchProp,
                    "type": "categorical",
                    "stops": stops
                }, 
                "fill-antialias": true, 
                "fill-outline-color": "rgba(255, 255, 255, 1)"
            }
        }, 'waterway-label');
    });
});

element.append('<div id="map" style="position:relative;top:0;bottom:0;width:100%;height:400px;"></div>');

The map is centered on the Boston area. You can zoom and pan the map to see any area of the United States. Note: the generated map is not available in preview mode.

Summary and next steps

Views from around the country each tell different stories about the composition of urban areas. Combine this map with your own data to discover deeper insights into customers or constituents.

Learn more:

Author

Raj Singh is a Developer Advocate and Open Data Lead at IBM Cloud Data Services. He specializes in all things geospatial and hacks on analytics in R/IBM Db2 Warehouse on Cloud and Spark/iPython notebooks.


Copyright © IBM Corp. 2018. This notebook and its source code are released under the terms of the MIT License.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up